ABSTRACT
This paper presents a new method for topic-based document segmentation, i.e., the identification of boundaries between parts of a document that bear on different topics. The method combines the use of the Probabilistic Latent Semantic Analysis (PLSA) model with the method of selecting segmentation points based on the similarity values between pairs of adjacent blocks. The use of PLSA allows for a better representation of sparse information in a text block, such as a sentence or a sequence of sentences. Furthermore, segmentation performance is improved by combining different instantiations of the same model, either using different random initializations or different numbers of latent classes. Results on commonly available data sets are significantly better than those of other state-of-the-art systems.
- A. Basu, I.R. Harris, and S. Basu. Minimum distance estimation: The approach using density-based distances. In G.S. Maddala and C.R. Rao, editors, Handbook of Statistics volume 15,pages 21--48. North-Holland, 1997.Google Scholar
- D. Beeferman, A. Berger, and J. Lafferty. Statistical models for text segmentation. Machine Learning 34:177--210, 1999. Google ScholarDigital Library
- D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. In Proceedings of NIPS-2001 Vancuver, BC, Canada, 2001.Google Scholar
- T.Brants.Test data likelihood for PLSA models. In ACM SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval Tampere, Finland, 2002.Google Scholar
- F.Y.Y. Choi. Advances in domain independent linear text segmentation. In Proceedings of NAACL-2000 pages 26--33, Seattle, WA, 2000. Google ScholarDigital Library
- F.Y.Y. Choi. Improving the efficiency of speech interfaces for text navigation. In Proceedings of the IEE colloquium: Speech and Language Processing for Disabled and Elderly People 2000.Google Scholar
- F.Y.Y. Choi, P.Wiemer-Hastings, and J.More. Latent semantic analysis for text segmentation. In L.Lee and D.Harman, editors, Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing pages 109--117, 2001.Google Scholar
- W.B. Croft, S.Cronen-Townsend, and V. Larvrenk. Relevance feedback and personalization: A language modeling perspective. In DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries 2001.Google Scholar
- S.C. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas, and R.A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6):391--407, 1990.Google ScholarCross Ref
- A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society 39(1):1--21,1977.Google Scholar
- D. Gildea and T. Hofmann. Topic-based language models using em. In Proceedings of the 6th European Conference on Speech Communication and Technology (EUROSPEECH), pages 2167--2170, 1999.Google Scholar
- M.A. Hearst and C. Plaunt. Subtopic structuring for full-length document access. In Research and Development in Information Retrieval pages 59--68, 1993. Google ScholarDigital Library
- T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of SIGIR-99 pages 35--44, Berkeley, CA, 1999. Google ScholarDigital Library
- T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning Journal 42(1):177--196, 2001. Google ScholarDigital Library
- T. Kailath. The divergence and bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Tech., COM-15:52--60,1967.Google ScholarCross Ref
- H. Kozima. Text segmentation based on similarity between words. In Meeting of the Association for Computational Linguistics pages 286--288, 1993. Google ScholarDigital Library
- S. Kullback and R.A. Leibler. On information and sufficiency. Annals of Mathematical Statistics 22:79--86, 1951.Google ScholarCross Ref
- V. Lavrenk, J. Allan, E. DeGuzman, D. LaFlamme, V. Pollard, and S. Thomas. Topic-based language models using em. In Proceedings ofthe 6th European Conference on Speech Communication and Technology (EUROSPEECH), pages 2167--2170, 1999.Google Scholar
- L. Lee. Measures of distributional similarity. In 37th Annual Meeting of the Association for Computational Linguistics pages 25--32, 1999. Google ScholarDigital Library
- H. Li and K. Yamanishi. Topic analysis using a finite mixture model. In Proceedings of Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 35--44, 2000. Google ScholarDigital Library
- H. Li and K. Yamanishi. Topic analysis using a finite mixture model. IPSJ SIGNotes Natural Language (NL), 139(009), 2000.Google Scholar
- L. Pevzner and M. Hearst. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics 28(1):19--36, 2002. Google ScholarDigital Library
- J.W. Tukey. Exploratory Data Analysis Addison Wesley Longman,Inc., Reading, MA, 1977.Google Scholar
Index Terms
- Topic-based document segmentation with probabilistic latent semantic analysis
Recommendations
Unsupervised mining of long time series based on latent topic model
This paper presents a novel unsupervised method for mining time series based on two generative topic models, i.e., probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA). The proposed method treats each time series as a text ...
Latent semantic rational kernels for topic spotting on conversational speech
In this work, we propose latent semantic rational kernels (LSRK) for topic spotting on conversational speech. Rather than mapping the input weighted finite-state transducers (WFSTs) onto a high dimensional n-gram feature space as in n-gram rational ...
Aspect-based sentence segmentation for sentiment summarization
TSA '09: Proceedings of the 1st international CIKM workshop on Topic-sentiment analysis for mass opinionAspect-based sentiment summarization systems generally use sentences associated with relevant aspects extracted from the reviews as the basis for summarization. However, in real reviews, a single sentence often exhibits several aspects for opinions. ...
Comments